Motif Scan and Match Significance


hits home doc index
back to the query form

Statistical Interpretation of Match Score

The matched sequences reported by search programs can be classified as true positives and false positives (the sequences missed by the program are the negatives). A true positive is a sequence that shares similarity with the query because both have evolved (diverged) from a common ancestral sequence, and is thus a true homolog. Similarity can also be sometimes attributed to evolutive convergence. A sequence is regarded as a false positive if the observed similarity is attributable to chance. It must be stressed that only biological arguments can let one decide whether a sequence should be regarded as a true or false positive. Nevertheless, a statistical analysis based on sound principles can help in the decision, because some matches are more likely to have been produced by chance than others.

Similarity-search tools that use profiles or hidden Markov models (HMMs) as queries produce lists of matched sequences that are aligned with the query, either locally or semi-globally. Every match receives a numerical raw score, whose value is guaranteed to meet some local-maximum criterion. Only matches with scores greater than some threshold are usually reported, and every profile or HMM possesses its own score thresholds for reporting a match. To make the interpretation of the match scores a little easier, one generally attempts to "rescale" the score onto a new scale which is common to all predictors and which possesses a well defined statistical meaning. Two such scales are the E-value of Pfam HMMs and the normalized score of Prosite profiles. How the rescaling is actually computed is out of the scope of this page.

The E-value is the number of matches with a score equal to or greater than the observed score that are expected to occur by chance. In other words, the E-value provides an estimation of the number of false positives. The E-value depends on the size of the searched database, as the number of false positives expected above a given score threshold usually increases proportionately with the size of the database. The total number of sequences and the total number of residues are the most frequently used measures for the database size.

The Prosite profiles report normalized scores instead of E-values, which are defined as the base 10 logarithm of the size (in residues) of the database in which one false positive match is expected to occur by chance. The normalized score is independent of the size of the searched databases. The so-called bit scores reported by other database-search programs have a distinct meaning but are also independent of the size of the searched database.

For a given database size of DB_size residues, the normalized score N_score and the E-value are easily interconvertible:

N_Score = log10 DB_size -log10 E-value

or

E-value = DB_size*10-N_Score

The following table gives some examples of conversions. The calculation are made for release 34 of SwissProt of october 1996 which contained 21'210'388 = 107.33 residues (in 59'021 entries) and for all the sequences found in the Hits database in june 2001 which amount to a total of 408'011'338 = 108.61 residues (in 1'818'627 entries).

normalized score E-value for SwissProt version 34 E-value for the Hits database june 2001
7.0
2.1
41
7.5
0.67
13
8.0
0.21
4.1
8.5
0.067
1.3
9.0
0.021
0.41
9.5
0.0067
0.13
10.0
0.0021
0.041
10.5
0.00067
0.013

How to Reach a Decision about Match Significance

The introduction of E-value and normalized scores greatly facilitates the interpretation of the match raw scores. To go one step further, one would like to reach a decision about the final status to give to a match. Is a given match to be regarded with some confidence, or is it questionable, i.e. a weak match? Indeed several cutoffs are supplied with Prosite profiles and Pfam HMMs just for that purpose. The idealized figure below illustrates the problem of defining these cutoffs. Note that the complete list of homologs is a priori unknown in most real situations which is a serious complication.

No false positive matches should be reported when considering the particular task of automated annotation of protein sequences. This dictates the use of cutoffs placed at a sufficiently high score, just to be on the safe side; this might correspond to the threshold indicated by the high arrow in the above figure. In principle, all matches tagged with ! in the output of the Motif Scan Server should belongs to this category. The same cutoffs are used by the InterPro team for the Prosite profiles and the Pfam HMMs, even if the scores of the matches are actually not reported by the InterProScan server (June 2001).

Gene discovery and the detection of remote homologs are other tasks where profile and HMMs proved to be successful. In this perspective, inspection of the matches in the twilight zone where true and false positives co-exist often yielded the most promising results. This would include the scores located between the low and high arrows on the above idealized figure. These weak or questionable matches are tagged with ? in the ouput of the Motif Scan Server and they must deserve further investigation before being reported elsewhere. Note also that (i) the normalized score is the most helpful to evaluate matches in this zone; (ii) the low score threshold is arbitrarily set for most predictors; (iii) the use of meta-motifs is usually an efficient and elegant manner to further evaluate weak matches.

Prosite Pattern

No scoring system is involved in the detection of a match by a pattern. It is nevertheless possible to estimate the confidence of a match, by considering the number of false positives reported by the Prosite entry. In the ouput of the Motif Scan Server, a match by a pattern that have zero positive listed in its Prosite entry is tagged with !. All the other patterns are considered to produce questionable matches and are tagged with ?. This division of the patterns into two categories must be taken with caution.

Frequently Matching Prosite Pattern

The skip flag /SKIP-FLAG=TRUE; is found in the Prosite entry of these patterns. No annotated list of matches is maintained because they produce too many false positives. The matches found with these patterns are only indicative of a possible function and they are tagged with ? in the ouput of the Motif Scan Server. Independent biological evidence must be considered to confirm the appropriateness of these matches.

Prosite Profile

Every entry of Prosite profiles has two cutoffs that permit the easy distinction of strong from weak matches. Matches with scores exceeding the so-called LEVEL=0 cutoff are tagged with ! in the ouptut of the Motif Scan Server. Matches with lower scores are tagged with ?.

For many profiles the LEVEL=0 cutoff is placed at a normalized score of 8.5. For some profiles however this cutoff was greatly increased, because those profiles are considered diagnostic for a domain sub-familly, and not for all possible homologs.

Profile normalized scores were converted to E-values for a database of 59'021 sequences of 359 residues average length, i.e. the release 34 of SwissProt.

Pfam HMM

The Pfam collection of HMMs was primarily designed for the automated annotation of genome data. In this perspective, it is essential that all false positives must be rejected, even at the cost of missing remote homologs. The computation that produces the decision to accept match(es) of a motif on a protein is quite intricate: The raw score of every match and the sum of the score of individual matches are first re-evaluated by the search program to take into account the composition of the matched protein. Then, the acceptation of these matches is based on the two cutoffs declared on the GA field of every Pfam entry; the first one is for sum of all matches (the per protein cumulated score), the second one is for the highest scoring individual match (and an additional cutoff is defined by default for the remaining individual match scores). Matches that satisfy these criteria are tagged with ! in the Motif Scan Server. The other matches, that have an E-value below 1 for the release 34 of SwissProt, but are not retained by the above mentionned strategy are tagged with ?.

Pfam E-value are reported for a database of 59'021 sequences of 359 residues average length, i.e. the release 34 of SwissProt. The same database was used for computing the corresponding normalized scores.


Comments to
Marco Pagni.